2,290 research outputs found

    FPGA-based high-performance neural network acceleration

    Full text link
    In the last ten years, Artificial Intelligence through Deep Neural Networks (DNNs) has penetrated virtually every aspect of science, technology, and business. Advances are rapid with thousands of papers being published annually. Many types of DNNs have been and continue to be developed -- in this thesis, we address Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Graph Neural Networks (GNNs) -- each with a different set of target applications and implementation challenges. The overall problem for all of these Neural Networks (NNs) is that their target applications generally pose stringent constraints on latency and throughput, but also have strict accuracy requirements. Much research has therefore gone into all aspects of improving NN quality and performance: algorithms, code optimization, acceleration with GPUs, and acceleration with hardware, both dedicated ASICs and off-the-shelf FPGAs. In this thesis, we concentrate on the last of these approaches. There have been many previous efforts in creating hardware to accelerate NNs. The problem designers face is that optimal NN models typically have significant irregularities, making them hardware unfriendly. One commonly used approach is to train NN models to follow regular computation and data patterns. This approach, however, can hurt the models' accuracy or lead to models with non-negligible redundancies. This dissertation takes a different approach. Instead of regularizing the model, we create architectures friendly to irregular models. Our thesis is that high-accuracy and high-performance NN inference and training can be achieved by creating a series of novel irregularity-aware architectures for Field-Programmable Gate Arrays (FPGAs). In four different studies on four different NN types, we find that this approach results in speedups of 2.1x to 3255x compared with carefully selected prior art; for inference, there is no change in accuracy. The bulk of this dissertation revolves around these studies, the various workload balancing techniques, and the resulting NN acceleration architectures. In particular, we propose four different architectures to handle, respectively, data structure level, operation level, bit level, and model level irregularities. At the data structure level, we propose AWB-GCN, which uses runtime workload rebalancing to handle Sparse Matrices Multiplications (SpMM) on extremely sparse and unbalanced input. With GNN inference as a case study, AWB-GCN achieves over 90% system efficiency, guarantees efficient off-chip memory access, and provides considerable speedups over CPUs (3255x), GPUs (80x), and a prior ASIC accelerator (5.1x). At the operation level, we propose O3BNN-R, which can detect redundant operations and prune them at run time. This works even for those that are highly data-dependent and unpredictable. With Binarized NNs (BNNs) as a case study, O3BNN-R can prune over 30% of the operations, without any accuracy loss, yielding speedups over state-of-the-art implementations on CPUs (1122x), GPUs (2.3x), and FPGAs (2.1x). At the bit level, we propose CQNN. CQNN embeds a Coarse-Grained Reconfigurable Architecture (CGRA) which can be programmed at runtime to support NN functions with various data-width requirements. Results show that CQNN can deliver us-level Quantized NN (QNN) inference. At the model level, we propose FPDeep, especially for training. In order to address model-level irregularity, FPDeep uses a novel model partitioning schemes to balance workload and storage among nodes. By using a hybrid of model and layer parallelism to train DNNs, FPDeep avoids the large gap that commonly occurs between training and testing accuracy due to the improper convergence to sharp minimizers (caused by large training batches). Results show that FPDeep provides scalable, fast, and accurate training and leads to 6.6x higher energy efficiency than GPUs

    Online Evaluation of Audiences for Targeted Advertising via Bandit Experiments

    Full text link
    Firms implementing digital advertising campaigns face a complex problem in determining the right match between their advertising creatives and target audiences. Typical solutions to the problem have leveraged non-experimental methods, or used "split-testing" strategies that have not explicitly addressed the complexities induced by targeted audiences that can potentially overlap with one another. This paper presents an adaptive algorithm that addresses the problem via online experimentation. The algorithm is set up as a contextual bandit and addresses the overlap issue by partitioning the target audiences into disjoint, non-overlapping sub-populations. It learns an optimal creative display policy in the disjoint space, while assessing in parallel which creative has the best match in the space of possibly overlapping target audiences. Experiments show that the proposed method is more efficient compared to naive "split-testing" or non-adaptive "A/B/n" testing based methods. We also describe a testing product we built that uses the algorithm. The product is currently deployed on the advertising platform of JD.com, an eCommerce company and a publisher of digital ads in China

    DPSA: Dense pixelwise spatial attention network for hatching egg fertility detection

    Get PDF
    © 2020 SPIE and IS & T. Deep convolutional neural networks show a good prospect in the fertility detection and classification of specific pathogen-free hatching egg embryos in the production of avian influenza vaccine, and our previous work has mainly investigated three factors of networks to push performance: depth, width, and cardinality. However, an important problem that feeble embryos with weak blood vessels interfering with the classification of resilient fertile ones remains. Inspired by fine-grained classification, we introduce the attention mechanism into our model by proposing a dense pixelwise spatial attention module combined with the existing channel attention through depthwise separable convolutions to further enhance the network class-discriminative ability. In our fused attention module, depthwise convolutions are used for channel-specific features learning, and dilated convolutions with different sampling rates are adopted to capture spatial multiscale context and preserve rich detail, which can maintain high resolution and increase receptive fields simultaneously. The attention mask with strong semantic information generated by aggregating outputs of the spatial pyramid dilated convolution is broadcasted to low-level features via elementwise multiplications, serving as a feature selector to emphasize informative features and suppress less useful ones. A series of experiments conducted on our hatching egg dataset show that our attention network achieves a lower misjudgment rate on weak embryos and a more stable accuracy, which is up to 98.3% and 99.1% on 5-day and 9-day old eggs, respectively

    High-Throughput Computational Screening of Two-Dimensional Semiconductors

    Full text link
    By performing high-throughput calculations using density functional theory combined with a semiempirical van der Waals dispersion correction, we screen 97 direct- and 253 indirect-gap two dimensional nonmagnetic semiconductors from near 1000 monolayers according to the energetic, thermodynamic, mechanical and dynamic stability criterions. We present the calculated results including lattice constants, formation energy, Young's modulus, Poisson's ratio, shear modulus, band gap, band structure, ionization energy and electron affinity for all the candidates satisfying our criteria.Comment: 12 pages, 11 figure
    • …
    corecore